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1.  Introduction 


As  materials  science  and  other  communities  are  becoming  increasingly  more  data- 
driven,  it  is  important  to  be  able  to  quantify  the  similarity  or  difference  between 
different  datasets.  Datasets  can  take  any  number  of  different  forms:  scalar  values,  1- 
dimensional  (1-D)  vectors,  2-dimensional  (2-D)  matrices,  //-dimensional  matrices, 
etc.  Mathematically,  there  are  a  number  of  different  measures  for  assessing  the  sim¬ 
ilarity  or  dissimilarity  (distance)  between  the  different  forms  of  these  datasets.1  For 
instance,  given  datasets  X  and  Y,  one  common  form  of  representing  their  distance 
is  the  Li-norm  (i.e.,  the  sum  of  the  absolute  difference  between  each  corresponding 
data  point  ^A  A",  —  Y,  \,  which  is  related  to  the  mean  absolute  error).  The  Euclidean 
distance,  or  L2-norm,  is  the  square  root  of  the  sum  of  the  squared  difference  between 
each  corresponding  data  point  (  JA  (A,-  —  Y,)2)  ,  related  to  the  root  mean  squared 

error.  Yet  another  measure  is  the  norm,  or  the  Chebyshev  norm,  which  is  the  the 
maximum  distance  between  the  2  datasets  (JA  (A*  —  YJ)00)1^00  =  max  |Aj,  Yi\. 
These  are  just  a  few  distance  examples  related  to  the  Lp  Minkowski  norms  and 
are  not  indicative  of  the  different  ways  to  characterize  the  similarity  or  difference 
between  2  datasets.  There  are  a  large  number  of  different  possible  similarity  and 
distance  measures  that  can  be  applied  to  different  datasets. 

One  important  property  that  the  metrics  discussed  herein  share  is  their  high  compu¬ 
tational  efficiency.  These  metrics  perform  a  bin-to-bin  comparison  between  datasets. 
Therefore,  these  metrics  scale  linearly,  proportional  to  O(N).  However,  the  bin-to- 
bin  comparison  does  have  some  limitations,  most  notably  these  metrics  do  not  take 
the  local  bin  environment  into  account  in  quantifying  the  distance.  Hence,  more 
complex  metrics  have  been  developed  to  overcome  bin-to-bin  metric  limitations, 
but  these  metrics  often  come  with  increased  computational  cost.  Nonetheless,  the 
bin-to-bin  metrics  described  within  this  technical  note  are  widely  used  and  may 
have  an  important  role  when  computing  the  distance  and  similarity  of  large  datasets 
and  when  considering  high-throughput  processes. 

In  this  technical  note,  a  number  of  different  measures  implemented  in  both  MAT- 
LAB  and  Python  as  functions  are  used  to  quantify  similarity/distance  between  2 
vector-based  datasets,  which  can  be  representative  of  vectors  of  values  being  com¬ 
pared  (e.g.,  histograms,  probability  distribution  functions,  signals).  The  scripts  are 
attached  as  appendixes  as  is  a  description  of  their  execution. 
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2.  Function  Description 


The  functions  used  to  compute  distance  or  similarity  measures  require  vectors  X 
and  Y  and  return  the  corresponding  similarities  or  distances  per  the  various  mea¬ 
sures  given  hereafter.  There  are  a  number  of  different  families  of  distance  and  sim- 
iliarity  functions,  which  are  given  in  Tables  1-9  and  are  briefly  discussed  hereafter. 


•  The  Lp  Minkowski  family  (Table  1)  contains  measures  related  to  the  gener¬ 
alized  formula,  ^/^A  \Xi  —  Yi |P,  where  p  encompasses  everything  from  the 
city  block  Li  distance  to  the  Chebyshev  distance. 

•  The  Li  family  (Table  2)  contains  measures  related  to  the  absolute  difference 
Li  distance  (i.e.,  JA  \Xt  -  Y*|). 

•  The  intersection  family  (Table  3)  contains  measures  related  to  the  intersec¬ 
tion  of  X  and  Y.  The  min(A,  Y)  or  max  (A",  Y)  term  appears  in  either  the 
denominator  or  numerator  for  this  family. 

•  The  inner  product  family  (Table  4)  contains  measures  related  to  the  summed 
inner  product,  or  dot  product,  of  X  and  Y  (i.e.,  JA  X,Y). 

•  The  fidelity  (or  squared-chord)  family  (Table  5)  contains  measures  related  to 
the  sum  of  the  square  root  of  the  inner  (dot)  product  (i.e.,  JA  \/XtYi),  which 
is  referred  to  as  the  Fidelity  similarity. 

•  The  squared  L2  (or  y2)  family  (Table  6)  contains  measures  related  to  the 
square  of  the  L2  (Euclidean)  norm  (i.e.,  JA  (A *  —  Y,,)2).  The  denominator 
in  some  of  these  measures  leads  to  an  asymmetric  response  if  X  and  Y  are 
swapped. 

•  The  Shannon’s  entropy  family  (Table  7)  contains  measures  related  to  Shan¬ 
non’s  concept  of  probabilistic  uncertainty  or  entropy  (e.g.,  ]A  kiA,:,  JA  lnV), 
or  some  similar  form). 

•  The  combination  family  (Table  8)  contains  measures  that  have  concepts  from 
multiple  families  (e.g.,  the  combined  average  of  the  Lx  and  norms). 

•  The  vicissitude  family  (Table  9)  contains  a  number  of  measures  introduced  in 
Cha.2 
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There  are  some  caveats  to  their  implementation,2  most  notably  errors  associated 
with  division  by  zero  or  taking  the  log  of  zero. 
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Table  1  L p  Minkowski  family 


City  block,  Li-norm 

^City  — 

E  \Xi  -  Yi\ 

(1) 

Euclidean,  L2-norm 

C^Eucl 

\JT.  (X,  -  Ytf 

(2) 

Minkowski,  Lp-norm 

1- 

</E  Wi  -  W 

(3) 

Chebyshev,  Loo-norm 

dev  = 

max;  \Xi  -  Yi\ 

(4) 

Table  2  Li  family 


S0rensen 

dS0  =  EIE:-^|/E(E  +  ^) 

(5) 

Gower 

do w  =  E  \Xi  -Yi\/b 

(6) 

Soergel 

dSg  =  E  \X%  —  /  E  max  (Xi,  Yi) 

(7) 

Kulczynski 

^Kui  =  E|2Q-WEminpq,Y,) 

(8) 

Canberra 

dc  anb  =  EI^-^l/^  +  Y,) 

(9) 

Lorentzian 

4or  =  E  ln  (1  +  \Xi  —  Yj  ) 

(10) 

Approved  for  public  release;  distribution  is  unlimited. 


Table  3  Intersection  family 


Intersection 

dis 

=  E\Xr-Yt\/2 

(ii) 

Wave  Hedges 

^WH 

=  E  \Xi  —  Yj\/ max  (X,,  Yj) 

(12) 

Czekanowski 

dc  z 

=  2Emin(Xi,yi)/E(^  +  ^) 

(13) 

Motyka 

^Mo 

=  Ema  x(Xl,yi)/E(^  +  ^) 

(14) 

Tanimoto 

d'Ta 

=  E  [max  ixii  Yi )  -  min  (Xu  Yi)]  /  E  maX  (Xi,  Yi ) 

(15) 

Intersection 

Sis 

=  E  min  PC, 

(16) 

Czekanowski 

S  Cz 

=  2Emin(Xi,yi)/E(^  +  ^) 

(17) 

Motyka 

%o 

=  E  min  (xi,  Yi)  /  E  PC  +  Yi) 

(18) 

Ruzicka 

SRu 

=  E  min  (9C ,  Yi )  /  E  max  PC ,  E ) 

(19) 
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Table  4  Inner  product  family 


ON 


Inner  product 

sIp  =  E  XiYi 

(20) 

Harmonic  mean 

SHm  =  2E  XiYi/iXi  +  Y^ 

(21) 

Cosine 

SC  os  = 

(22) 

Kumar-Hassebrook 

skh  =  E  xiYi/  E(^2  +  Yi  -  XiYi) 

(23) 

Jaccard 

sJa  =  E  XM/ Z(X?  +  Y?- XX) 

(24) 

Jaccard 

dJa  =  E  (Xi-YlWUXf  +  Yf-XM) 

(25) 

Dice 

sDl  =  2EA WUXf  +  Y?) 

(26) 

Dice 

dm  =  E(xi  -  Yd2/  E(E2  +  Y?) 

(27) 

Table  5  Fidelity  family 


Fidelity 

•SFid  =  Effi 

(28) 

Bhattacharyya 

dB  a  =  -InEvTO 

(29) 

Hellinger 

dHe  =  V2E(^-  V^)2 

(30) 

Matusita 

<W  = 

(31) 

Squared-chord 

dsc  =  E(V^-VYi)2 

(32) 

Squared-chord 

ssc  =  2E>OT-1 

(33) 
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Table  6  The  x2  family 

Squared  Euclidean 

dSE  =  E  (Xi  ~  Yif 

(34) 

Pearson  x2 

dPe a  =  E  (Xi-Ytf/Yi 

(35) 

Neyman  x2 

dNey  =  E  (Xi  -  Yif  /Xi 

(36) 

Squared  x2 

dsqx  =  E  (Xi  -  Yi)2  /  (Xi  +  Yi) 

(37) 

Probabilistic  Symmetric  x2 

dsyx  =  2  E  (Xi  —  Yi)2  /  (Xi  +  Yi) 

(38) 

Divergence 

dDlv  =  2  E  (Xi  —  Yi)2  /  (Xi  +  Yi)2 

(39) 

Clark 

da  =  v/Edl^-Ki/w  +  y,)]2) 

(40) 

Additive  Symmetric  x2 

dMx  =  EW-fifW  +  i5)W 

(41) 
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Table  7  Shannon’s  entropy  family 


Kullback-Leibler 

d-KL  = 

E^iln  (Xi/Yi) 

(42) 

Jeffreys 

dicf  = 

E  (Xi  -  Y)  In  {Xi/Yi) 

(43) 

K  divergence 

^Kdv  = 

E^*ln(2 \Xi/{Xi  +  Y$) 

(44) 

Tops0e 

^Top 

E  {Xi  In  (2 Xi/  (. Xi  +  Y))  +  Yi  In  (2 Y/  (Xt  +  Y))) 

(45) 

Jensen-Shannon 

dis  = 

E  {Xi  In  (2 Xi/  (Xt  +  Y))  +  Yi  In  (2 Y/  ( X *  +  Y)))  /2 

(46) 

Jensen  difference 

djdf  — 

E  {{Xi  In  Xi  +  Yi  In  Y)  /2  -  (X,  +  Y)  /2  In  ((X,  +  Y)  /2)) 

(47) 

Table  8 

Combination  family 

Taneja 

<7Tan  = 

E(*i  +  Y)/2  ln((Xj  4-  Y)/( 2V%Y)) 

(48) 

Kumar-Johnson 

d  KJ  = 

E(E2  -  y2)2/2(^Y)3/2 

(49) 

Average(Li,L0O) 

^avL 

E(|^i-Y|+maxi|Xi-Y|)/2 

(50) 

Approved  for  public  release;  distribution  is  unlimited. 


Table  9  Vicissitude  family 


Vicis-Wave  Hedges 

d\  wh 

=  \Xi  —  Y%\/ min(2Q,  Yj) 

(51) 

Vicis-Symmetric  x2 

dvsXl 

=  E(^-^)2/min(^,y,)2 

(52) 

Vicis-Symmetric  x2 

d\SX2 

=  Y.{Xi-Yi)2/™:Yn{Xi,Yi) 

(53) 

Vicis-Symmetric  x2 

dv  SX3 

=  —  Yi)2/  max(2Q,  Yi) 

(54) 

max-Symmetric  x2 

^MaxS 

=  m.KK{YJ{Xi-Yi)2/Xi,YJ{Xi-YiY/Yi) 

(55) 

min-Symmetric  x2 

C^MinS 

=  min (£(Xi  -  Yi)2 /Xi,  £(3Q  -  ^)2/^) 

(56) 

Figure  1  shows  an  example  of  the  general  form  of  the  similarity/distance  measures 
as  applied  to  Gaussian  peaks.  In  this  case,  the  original  (Gaussian)  peaks  X  are  com¬ 
pared  with  the  modified  peaks  Y  using  4  different  forms  within  the  similarity  met¬ 
rics:  L  | -norm,  intersection,  inner  product,  and  Shannon  entropy.  First,  the  similarity 
scalar  value  s,  is  computed  as  a  function  of  bin  value  for  the  2  top  peaks  (X  and 
Y,  respectively).  Next  (not  shown),  the  individual  .st  are  summed  over  all  i  to  pro¬ 
duce  a  single  scalar  value  of  similarity,  s  =  s,  (Xu  Yt ) .  Interestingly,  different 
similarity  metrics  may  accentuate  different  characteristics  within  the  data  signal 
(i.e.,  peaks  in  this  case).  For  example,  notice  how  the  Shannon  entropy  accentuates 
peak  shift  and  broadening  but  is  comparatively  not  very  sensitive  to  the  other  peak 
modifications  shown  in  Fig.  1. 


'  Original  /  \ 

/  \ 

/  \ 

Peaks  /  \ 

/  \ 

/  \ 

/  \ 

J  v 

J  v 

J  V 

J  v 

Split 


X 


y 


yVmnfe.a) 


Y,x<y‘ 


52(*t-is)s 


Fig.  1  Example  of  how  different  families  of  metrics  are  influenced  by  minor  deviations  in 
peak  position,  peak  broadening,  peak  splitting,  and  noise  (modified  peaks,  from  left  to  right) 
for  4  Gaussian  curves.  The  original  peaks  (top,  in  green)  are  compared  to  the  modified  peaks 
(second  from  top,  in  green)  for  the  Li  norm  metric,  the  intersection  metric,  the  inner  product 
metric,  and  the  Shannon  entropy  metrics  (in  red). 
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3.  Implementation  and  Usage 


The  implementation  of  the  1-D  dataset  distance/similarity  measurements  were  im¬ 
plemented  in  MATLAB  and  Python.  The  respective  function/class  can  be  found  in 
Appendix  A  and  Appendix  B,  respectively.  The  MATLAB  function  compares  two 
1-D  vectors  of  equal  size  and  returns  structured  variables  with  the  various  similar¬ 
ity  and  distance  metrics.  The  Python  class  provides  a  framework  that  compares  two 
1-D  vectors  of  equal  size  and  returns  the  measured  distances  per  family. 

The  attached  MATLAB  function  has  been  tested  on  MATLAB  R2014  and  R2015 
on  a  Windows  operating  system.  The  Python  class  has  been  tested  with  Python 
2.7.*  and  numpy  1.11.*  versions  on  RHEL  (6.8)  and  MacOSX  (El  Capitan).  The 
MATLAB  code  can  be  executed  by  completing  the  following: 

•  Download  the  various  scripts  into  the  same  directory: 

-  compute_metrics.m 

•  Open  the  script  in  MATLAB 

•  Type  ‘compute_metrics(X,Y,3)’  at  the  command  prompt  to  run.  The  third 
argument  is  used  for  the  Minkowski  metric  in  the  Lp  family  (i.e.,  p  =  3) 

The  Python  class  can  be  imported  as  a  module  (e.g.,  import  PyDIST  as  dists)  and 
used  as  instructed  within  the  class  “DESCRIPTION”.  The  following  example  was 
generated  by  using  the  MATLAB  scripts.  A  detailed  analysis  on  the  sensitivity  of 
these  metrics  and  their  applicability  for  quantifying  differences  in  X-ray  diffraction 
(XRD)  features  is  presented  by  Hemandez-Rivera  et  al.3 

4.  Examples 

As  an  example,  3  different  XRD  patterns  (Fig.  2)  are  compared  using  the  different 
distance  and  similarity  metrics.  Three  different  26  ranges  (full  XRD  pattern,  18°- 
25°  range,  and  36°-40°  range)  are  used  to  assess  the  quantitative  difference  between 
the  3  XRD  patterns  and  to  show  how  much  the  metrics  change  as  a  function  of  the 
range  used  for  the  3  patterns. 
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Fig.  2  Example  of  3  different  XRD  patterns,  which  are  slightly  offset  to  show  the  different 
peaks.  The  bottom  plots  show  a  magnified  view  of  the  master  XRD  patterns  (above).  In  each 
of  the  cases,  the  intensity  has  been  normalized  such  that  the  area  is  1. 


Figure  3  shows  the  different  Lp-norm  metrics  for  the  full  range  of  the  3  XRD  pat¬ 
terns.  The  first  row/column  is  the  bottom  pattern  in  Fig.  2,  the  second  row/column 
is  the  middle  pattern,  and  the  third  row/column  is  the  top  pattern.  The  intersection 
of  the  different  patterns  in  the  contour  plot  is  the  distance  between  the  2  XRD  pat¬ 
terns.  Notice  that  the  intersection  of  each  pattern  with  itself  has  a  distance  of  zero 
and  the  maximum  distance  is  normalized  to  a  value  of  one,  helping  to  show  the 
comparison  between  different  metrics.  For  instance,  in  Fig.  3,  all  metrics  suggest 
that  XRD  patterns  1  and  2  are  furthest  apart  (i.e.,  dissimilar),  while  patterns  1  and 
3  are  consistently  the  most  similar. 
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City  Block,  Li-Norm  Euclidean,  L2-Norm 


Fig.  3  Distance  metrics  of  the  Lp  family  computed  for  the  3  XRD  patterns  shown  in  Fig.  2. 
The  diagonal  is  showing  each  XRD  pattern  compared  against  itself  (i.e.,  the  distance  is  zero). 
The  remaining  distance  values  have  been  normalized  such  that  the  maximum  distance  for  each 
metric  is  equal  to  1. 
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Figure  4  shows  several  different  metrics  from  the  hi  family  for  the  3  XRD  patterns. 
Again,  comparing  with  the  distance  values  from  the  hp  family  in  Fig.  3,  these  met¬ 
rics  also  agree  in  terms  of  indicating  patterns  1  and  2  are  the  most  dissimilar  as  well 
as  showing  that  patterns  1  and  3  are  the  most  similar.  Interestingly,  it  can  be  seen 
that  certain  metrics,  such  as  the  Kulczynski  distance,  are  more  sensitive  to  changes 
between  the  3  patterns  (i.e.,  the  lowest  normalized  distance  of  0.64  is  much  lower 
than  the  other  metrics). 
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Fig.  4  Distance  metrics  of  the  Li  family  computed  for  the  3  XRD  patterns  in  Fig.  2 
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Figure  5  shows  4  distance  metrics  from  4  other  families  of  metrics:  intersection, 
inner  product,  fidelity,  and  squared  euclidean.  The  different  contour  plots  are  for 
the  3  different  ranges  depicted  in  Fig.  2,  with  the  leftmost  plots  being  for  the  full- 
range  XRD  patterns.  First,  the  order  of  similarity /distance  between  the  full-range 
patterns  is  identical  for  the  metrics  shown.  However,  for  the  18°-25°  range,  patterns 

2  and  3  are  computed  to  be  most  dissimilar  by  the  different  metrics.  Furthermore, 
the  36°-40°  range  has  mixed  results  in  terms  of  quantifying  the  patterns  that  are 
most  dissimilar.  In  terms  of  the  most  similar  XRD  patterns,  though,  the  restricted 
26  ranges  agree  with  the  computed  distances  for  the  full  range  (i.e.,  patterns  2  and 

3  are  consistently  calculated  as  the  most  similar). 
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Fig.  5  Distance  metrics  of  the  intersection,  inner  product  (cosine),  fidelity,  and  squared  eu¬ 
clidean  families  computed  for  the  3  regions  in  the  3  XRD  patterns  in  Fig.  2.  The  left  contour 
map  is  of  the  full  XRD  pattern  and  the  other  2  are  for  the  2  magnified  views  of  the  peaks 
(bottom  left  and  bottom  right  in  Fig.  2). 
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5.  Summary 


It  is  often  important  to  characterize  the  similarity  or  dissimilarity  (distance)  between 
different  measured  or  computed  datasets.  There  are  a  large  number  of  different 
possible  similarity  and  distance  measures  that  can  be  applied  to  different  datasets.  In 
this  technical  note,  a  number  of  different  measures  implemented  in  both  MATLAB 
and  Python  as  functions  are  used  to  quantify  similarity /distance  between  2  vector- 
based  datasets.  The  scripts  are  attached  as  appendixes  as  is  a  description  of  their 
execution. 

The  PyDIST.py  code  can  be  downloaded  by  clicking  here. 

The  compute _metrics.m  code  can  be  downloaded  by  clicking  here. 
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Intentionally  lelt  blank. 
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Appendix  A.  Python  Function 


This  appendix  appears  in  its  original  form,  without  editorial  change. 
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: class dist '  —  Distance  metrics 


This  module  provides  a  framework  to  calculate  several  types  of  distance 
metrics  to  compare  two  (N, )  arrays . 

. .  Copyright  2016  Efrain  Hernandez-Rivera 

Last  updated:  2016-09-12  by  E.  Hernandez-Rivera 
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import  numpy  as  np 

from  math  import  sqrt,log 

class  Distances (object) : 

n u u  python  distance/similarity  module.  Currently,  includes  distances 
from  Cha  [1] . 

Paramaters 


family:  name  of  the  distance  family 
minkowski 
LI 

inter 

inner 

fidelity 

squaredL2 

shannon 

combination 

vicissitude 

X:  array_like 

Reference  histogram/distribution 
Y :  array_like 

Histogram/distribution  to  measure  distance  from  X 
Returns 


distances:  diet 

Dictionary  containing  distances  for  all  the  family  members  as 
outlined  by  Cha 


Usage 


>>>  import  PyDIST  as  distance 
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>>>  dist=distance . Distances ([ 1 , 2 , 3 ] ,  [4,6,8]) 


>>>  mink=dist . minkowski ( ) 

References 


[1]  Cha,  S.H,  IJMMMAS,  v.  1,  iss.  4,  pp .  300-307  (2007) 

[2]  Hernandez-Rivera,  et  al .  ACS  Comb  Sci,  accepted  (2016) 


def  _ init _ (self,P,Q) : 

if  sum(P)<le-20  or  sum (Q) <le-20 : 

raise  "One  or  both  vector  are  zero  (empty) ..." 
if  len(P) !=len(Q) : 

raise  "Arrays  need  to  be  of  equal  sizes..." 

#use  numpy  arrays  for  efficient  coding 

P=np . array (P, dtype=f loat ) ; Q=np . array (Q, dtype=f loat ) 

#Correct  for  zero  values 
P [np . where (P<le-20 ) ] =le-20 
Q [np . where (Q<le-20 ) ] =le-20 

self . P=P 
self . Q=Q 

def  minkowski ( self , n=l ) : 

P=self.P;  Q=self.Q 

return  {'Euclidean'  : sqrt (sum ( (P-Q) * (P-Q) ) ) , \ 

'City  Block ': sum (abs (P-Q) ), \ 

'Minkowski '  : (sum (abs (P-Q) **n) ) ** (l./n) , \ 

'Chebyshev'  :max (abs (P-Q) ) } 


def  LI (self) : 

P=self.P;  Q=self.Q;  A=sum (abs (P-Q) ) ;  d=len(P) 
return  {'Sorensen'  : A/sum (P+Q) , \ 

' Gower '  : A/d, \ 

' Sorgel '  : A/ sum (np . maximum (P , Q) ) , \ 

' Kulczynski ' : A/ sum (np . minimum (P , Q) ) , \ 
' Canberra '  : sum (abs (P-Q) / (P+Q) ) , \ 

' Lorentzian ' : sum (np . log ( 1+abs (P-Q) ) ) } 


def  inter (self) : 

P=self.P;  Q=self.Q;  A=sum (abs (P-Q) ) ;  maxPQ=sum (np .maximum (P, Q) ) 
return  { ' Intersection ' : 0 . 5*A, \ 

'Wave  Hedges'  : sum (abs (P-Q) /np .maximum (P, Q) ), \ 

' Czekanowski '  : A/sum (P+Q) , \ 

'Motyka'  :maxPQ/sum (P+Q) , \ 

' Ruzicka '  : 1-sum (np . minimum (P , Q) ) /maxPQ, \ 


' Tanimoto ' 


: sum ( no . maximum ( P . Ob 


-nn  m  n  ri  ~\  mi  im 


def  inner (self) : 

P=self.P;  Q=self.Q;  ip=sum(P*Q);  p2=sum(P*P);  q2=sum(Q*Q);  d=len(P) 
return  {'Inner  Product ': 1-ip, \ 
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'Harmonic  Mean 
' Cosine ' 

' Jaccard ' 

' Dice ' 


1-2 . *sum (P*Q/ (P+Q) )  ,  \ 

1-ip/ ( sqrt (p2 ) *sqrt (q2) ) , \ 
sum ( (P-Q) * (P-Q) ) / (p2+q2-ip) , \ 
sum ( (P-Q) * (P-Q) ) / (p2+q2 ) } 


def  fidelity (self ) : 

P=self.P;  Q=self.Q;  f id=sum (np . sqrt (P*Q) ) 
return  {'Fidelity'  : 1-fid, \ 

' Bhattacharyya ' : -log (fid) ,\ 

'Hellinger'  : 2*sqrt ( 1-fid) , \ 

'Matusita'  : sqrt (2-2*f id) , \ 

' Squared-Chord ' : sum ( (np . sqrt (P ) -np . sqrt (Q) ) **2 ) } 


def  squaredL2 (self ) : 

P=self.P;  Q=self.Q;  d=len(P) 

return  {'Squared  Euclidean ': sum ( (P-Q) **2 ), \ 

'Pearson  Chi ' : sum ( (P-Q) **2/Q) , \ 

'Neyman  Chi'  : sum ( (P-Q) **2/P) , \ 

' Squared  Chi ' : sum ( (P-Q) **2/ (P+Q) ) , \ 

'Prob  Symm '  :2*sum( (P-Q) **2/ (P+Q) ) , \ 

' Divergence '  : 2* sum ( (P-Q) **2/ (P+Q) **2) , \ 

' Clark '  : sqrt (sum ( (abs (P-Q) / (P+Q) ) **2) ) , \ 

'Additive  Symm' : sum( (P-Q) **2* (P+Q) / (P*Q) ) } 


def 


shannon ( self ) : 

P=self.P;  Q=self.Q 

return  { ' Kull-Leiber ' : sum (P*np . log (P/Q) ) , \ 

' Jeffreys '  : sum ( (P-Q) *np . log (P/Q) ) , \ 

' Kdivergence ' : sum (P*np . log (2*P/ (P+Q) ) ) , \ 

' Topsoe '  : sum (P*np . log (2*P/ (P+Q) ) +Q*np . log (2*Q/ (P+Q) ) ) , \ 

' Jensen-Shan ' : 0 . 5* sum (P*np . log (2*P/ (P+Q) ) \ 

+Q*np . log (2*Q/ (P+Q) ) ) , \ 

' Jensen-Dif f ' : 0 . 5* sum (P*np . log (P) +Q*np . log (Q) \ 

- (P+Q) *np . log ( (P+Q) 12.))} 


def  combination ( self ) : 
P=self.P;  Q=self.Q 
return  {'Taneja' 

' Kumar- John ' 
' AverageL ' 


0 . 5* sum ( (P+Q) *np . log ( (P+Q) / (2 . *np . sqrt (P*Q) ) ) ) , \ 
sum ( (P*P-Q*Q) **2/ {2*  (P*Q) ** (1.5) ) ) f \ 

0.5* (sum (abs (P-Q) ) +max (abs (P-Q) ) ) } 


def  vicissitude (self ) : 

P=self.P;  Q=self.Q;  p=sum ( (P-Q) * (P-Q) /P) ;  q=sum ( (P-Q) * (P-Q) /Q) 
pqmin=np . minimum (P ,  Q) 

return  { 'Vicis-Wave  Hedge ': sum (abs (P-Q) /pqmin) , \ 

'Vicis-Symm  Chil '  : sum ( (P-Q) * (P-Q) /pqmin**2 ) , \ 

'Vicis-Symm  Chi2 '  : sum ( (P-Q) * (P-Q) /pqmin) , \ 

'Vicis-Symm  Chi3 '  : sum ( (P-Q) * (P-Q) /np .maximum (P, Q) ), \ 

'Max-Symm  Chi'  :max(p,q),\ 

'Min-Symm  Chi'  :min(p,q)} 
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Appendix  B.  MATLAB  Function 


This  appendix  appears  in  its  original  form,  without  editorial  change. 
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function  [D,S, Error]  =  compute_metrics (P , Q, n) 

%  MA  Tschopp 

%  Purpose:  Implementation  of  Cha  (2007)  similarity/distance  metrics 
%  Date:  Sept  2016 


%  For  test  purposes: 

P  =  rand (1,10);  P  =  P/sum(P); 

Q  =  rand(l,10);  Q  =  Q/sum(Q); 

P(l)  =  0; 

Q (2)  =  0; 

Q ( 3 ) =P  ( 3 ) ; 

P  =  P/sum  (P) ; 

Q  =  Q/ sum (Q) ; 

Q ( 3 ) =P  ( 3 ) ; 
n  =  3; 

[D1,S1,~]  =  compute_metrics (P,  Q,  n) 
P (P==0) =le-20; 

Q(Q==0)=le-20; 

P (P==Q) =P (P==Q) +le-20; 

%P  =  P/sum (P) ; 

%Q  =  Q/sum (Q) ; 

[D,S,~]  =  compute_metrics (P, Q, n) 


%  Correct  for  divide  by  zero 
PI  =  P;  Q1  =  Q; 

PI  (P==0) =le-20; 

Q1 (Q==0) =le-20; 

PI  (P==Q) =P1 (P==Q) +le-2 0 ; 

D= [ ] ;  S= [ ] ;  Error  =  [ ] ; 

%  Lp  Minkowski  family 

D. euclidean  =  sqrt (sum ( (P-Q) . A2) ) ;  %sqrt (dot (P-Q, P-Q) ) ; 
D.cityblock  =  sum (abs (P-Q) ) ; 

D.minkowski  =  sum (abs (P-Q) . An) A ( 1 /n) ; 

D.chebyshev  =  max (abs (P-Q) ) ; 

%  LI  family 

D. sorensen  =  sum (abs (P-Q) ) /sum (P+Q) ; 

D.gower  =  sum (abs (P-Q) ) /length (P ) ; 

D.soergel  =  sum (abs (P-Q) ) /sum (max (P, Q) ) ; 
if  sum (min (P, Q) ) ~=0 

D . kulczynski_d  =  sum (abs (P-Q) ) /sum (min (P, Q) ) ; 

else 

D . kulczynski_d  =  sum (abs (Pl-Ql ) ) /sum (min (PI , Q1 ) ) ; 
Error . kulczynski_d  =  ’Divide  by  Zero  Error'; 

end 

if  min (P+Q) ~=0 

D. Canberra  =  sum (abs (P-Q) ./ (P+Q) ) ; 

else 

D. Canberra  =  sum (abs (Pl-Ql )./ (Pl+Ql ))  ; 
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Error . Canberra  =  'Divide  by  Zero  Error'; 

end 

D.lorentzian  =  sum (log ( 1+abs (P-Q) ) ) ; 

%  Intersection  family 
S . intersection  =  sum (min (P, Q) ) ; 

D . intersection  =  1-sum (min (P, Q) ) ; 
if  min (max (P, Q) ) ~=0 

D.wavehedges  =  sum (1-min (P, Q) . /max (P, Q) ) ; 

else 

D.wavehedges  =  sum (1-min (PI, Ql) . /max (PI, Ql) ) ; 

Error . wavehedges  =  'Divide  by  Zero  Error'; 

end 

S . czekanowski  =  2*sum (min (P, Q) ) /sum (P+Q) ; 

D . czekanowski  =  1-2 . *sum (min (P, Q) ) /sum (P+Q) ; 

S.motyka  =  sum (min (P, Q) ) /sum (P+Q) ; 

D.motyka  =  1-sum (min (P, Q) ) /sum (P+Q) ; 
if  sum (abs (P-Q) ) ~=0 

S . kulczynski_s  =  sum (min (P, Q) ) /sum (abs (P-Q) ) ; 

else 

S . kulczynski_s  =  sum (min (PI, Ql) ) /sum (abs (Pl-Ql) ) ; 

Error . kulczynski_s  =  'Divide  by  Zero  Error'; 

end 

if  sum (min (P, Q) ) ~=0 

D . kulczynski_s  =  1/S . kulczynski_s; 

else 

D . kulczynski_s  =  1/ (sum (min (PI , Ql ) ) /sum (abs (Pl-Ql ) ) ) ; 

Error . kulczynski_s  =  'Divide  by  Zero  Error'; 

end 

S.ruzicka  =  sum (min (P , Q) ) /sum (max (P , Q) ) ; 

D.ruzicka  =  1-S.ruzicka; 

D . tanimoto  =  (sum (P) +sum (Q) -2* sum (min (P, Q) ) ) / (sum (P) +sum (Q) -sum (min (P, Q) ) ) 

%  Inner  product  family 
S.  inner  =  dot(P,Q); 

D . inner  =  1-S. inner; 
if  min (P+Q) ~=0 

S . harmonic  =  2* sum ( (P . *Q) .  / (P+Q) ) ; 

D. harmonic  =  1-S . harmonic; 

else 

S. harmonic  =  2*sum ( (PI . *Q1)  . / (Pl+Ql ) ) ; 

D. harmonic  =  1-S . harmonic; 

Error . harmonic  =  'Divide  by  Zero  Error'; 

end 

S. cosine  =  dot  (P,  Q) /sqrt  (dot  (P,  P)  *dot  (Q,  Q)  )  ; 

D. cosine  =  1-S. cosine; 

S.kumarh  =  dot  (P,  Q)  /  (dot  (P,  P) +dot  (Q,  Q) -dot  (P,  Q)  )  ; 

D . kumarh  =  1-S.kumarh; 

S.jaccard  =  dot  (P,  Q)  /  (dot  (P,  P) +dot  (Q,  Q) -dot  (P,  Q)  )  ; 

D.jaccard  =  dot (P-Q, P-Q) / (dot (P, P) +dot (Q, Q) -dot (P, Q) ) ; 

S.dice  =  2*dot (P,Q) / (dot (P, P) +dot (Q, Q) ) ; 

D . dice  =  l-2*dot (P,Q) / (dot (P,P)+dot (Q,Q) ) ; 

%  Fidelity  (square  chord)  family 
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S. fidelity  =  sum (sqrt (P . *Q) ) ; 

D. fidelity  =  1-S . fidelity; 
if  sum (sqrt (P . *Q) ) ~=0 

D .bhattacharrya  =  -log (sum (sqrt (P . *Q) )) ; 

else 

D .bhattacharrya  =  -log (sum (sqrt (PI . *Q1 ) ) ) ; 

Error . bhattacharrya  =  'log(O)  Error'; 

end 

D.hellinger  =  sqrt (2*sum ( (sqrt (P) -sqrt (Q) ). A2) ) ; 

D . squaredchord  =  dot (sqrt (P) -sqrt (Q) , sqrt (P ) -sqrt (Q) ) ; 

S . squaredchord  =  2*sum (sqrt (P . *Q) ) -1 ; 

D.matusita  =  sqrt (dot (sqrt (P) -sqrt (Q) , sqrt (P) -sqrt (Q) )) ; 

%  Squared  L2  (chi-squared)  family 
D . squaredeuclidean  =  dot (P-Q, P-Q) ; 
if  min (Q)  ~=  0 

D.pearsonchi  =  sum ( (P-Q) . A2 . /Q) ; 

else 

D.pearsonchi  =  sum ( (Pl-Ql ) . A2 . /Q1 ) ; 

Error . pearsonchi  =  'Divide  by  Zero  Error'; 

end 

if  min(P)  ~=  0 

D.neymanchi  =  sum ( (P-Q)  . A2  . /P) ;  %=pearsonchi (Q, P ) 

else 

D.neymanchi  =  sum ( (Pl-Ql ). A2 . /PI ) ;  %=pearsonchi  (Q, P ) 
Error . neymanchi  =  'Divide  by  Zero  Error'; 

end 

if  min (P+Q) ~=0 

D.squaredchi  =  sum ( (P-Q)  . A2 . /  (P+Q) ) ; 

else 

D.squaredchi  =  sum ( (Pl-Ql ). A2 ./ (Pl+Ql )) ; 

Error . squaredchi  =  'Divide  by  Zero  Error'; 

end 

D.probsymm  =  2*d . squaredchi; 
if  min (P+Q) ~=0 

D. divergence  =  2*sum ( (P-Q)  . A2 . / (P+Q)  . A2 ) ; 

D.  dark  =  sqrt ( sum ( (abs (P-Q) ./ (P+Q) ). A2 )) ; 

else 

D. divergence  =  2*sum ( (Pl-Ql) . A2 ./ (Pl+Ql) . A2) ; 

D. dark  =  sqrt (sum ( (abs (Pl-Ql )./ (Pl+Ql )). A2 )) ; 

Error . divergence  =  'Divide  by  Zero  Error'; 

Error . dark  =  'Divide  by  Zero  Error'; 

end 

if  min (P . *Q) ~=0 

D . additivesymm  =  sum ( (P-Q)  . A2 . *  (P+Q)  . / (P . *Q) ) ; 

else 

D . additivesymm  =  sum ( (Pl-Ql)  . A2 .*  (Pl+Ql )./ (PI . *Q1 )) ; 
Error . additivesymm  =  'Divide  by  Zero  Error'; 

end 

%  Shannon's  entropy  family 

if  min (Q) ~=0  &&  min(P./Q)~=0 

D . kullback_PQ  =  sum (P . *log (P . /Q) ) ; 
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else 


D . kullback_PQ  =  sum (PI . *log (PI . /Q1 ) ) ; 

Error . kullback_PQ  =  ’Divide  by  Zero  or  log(O)  Error'; 

end 

if  min(P)~=0  &&  min(Q./P)~=0 

D . kullback_QP  =  sum (Q . *log (Q . /P) ) ; 

else 

D . kullback_QP  =  sum (Q1 . *log (Q1 . /PI ) ) ; 

Error . kullback_QP  =  'Divide  by  Zero  or  log(O)  Error'; 

end 

if  min (Q) ~=0  &&  min(P./Q)~=0 

D.jeffreys  =  sum ( (P-Q)  . *log  (P . /Q) ) ; 

else 

D.jeffreys  =  sum ( (Pl-Ql ) . *log (PI . /Q1 ) ) ; 

Error . jeffreys  =  'Divide  by  Zero  or  log(O)  Error'; 

end 

if  min (P+Q) ~=0  &&  min (P . / (P+Q) ) ~=0 

D . kdivergence  =  sum (P . *log (2*P . / (P+Q) ) ) ; 

else 

D . kdivergence  =  sum (PI . *log (2*P1 . / (Pl+Ql ) ) ) ; 

Error . kdivergence  =  'Divide  by  Zero  or  log(O)  Error'; 

end 

if  min (P+Q) ~=0  &&  min (P . / (P+Q) ) ~=0  &&  min (Q . / (P+Q) ) ~=0 

D.topsoe  =  sum (P . *log (2*P . / (P+Q) ) +Q . *log (2*Q . / (P+Q) ) ) ; 

else 

D.topsoe  =  sum (PI . *log (2*P1 . / (Pl+Ql) ) +Q1 . *log (2*Q1 . / (Pl+Ql ) ) ) ; 

Error. topsoe  =  'Divide  by  Zero  or  log(O)  Error'; 

end 

if  min (Q) ~=0  &&  min(P./Q)~=0  &&  min (P ) ~=0  &&  min(Q./P)~=0 
D.jensen_s  =  0 . 5* (D . kullback_PQ+D . kullback_QP) ; 

else 

D.jensen_s  =  0 . 5* (D . kullback_PQ+D . kullback_QP) ; 

Error . jensen_s  =  'Divide  by  Zero  or  log(O)  Error'; 

end 

if  min (P+Q) ~=0  &&  min (P ) ~=0  &&  min (Q) ~=0 

D . jensen_d  =  sum (0.5* (P . *log (P) +Q . *log (Q) - (P+Q)  . *log  (0.5* (P+Q) ) ) ) ; 

else 

D . jensen_d  =  sum (0.5* (PI . *log (PI) +Ql.*log(Ql) - (Pl+Ql) .*log(0.5* (Pl+Ql ) ) ) ) ; 
Error . jensen_d  =  'log(O)  Error'; 

end 

%  Combinations 

if  min (dot (P, Q) ) ~=0  &&  min ( (P+Q) . /sqrt (dot (P, Q) ) ) ~=0 

D.taneja  =  sum (0 . 5* (P+Q) . *log (0 . 5* (P+Q) . /sqrt (dot (P, Q) ) ) ) ; 

else 

D.taneja  =  sum (0.5* (Pl+Ql) .*log(0.5* (Pl+Ql) . /sqrt (dot (PI, Ql) ) ) ) ; 

Error. taneja  =  'Divide  by  Zero  or  log(0)  Error'; 

end 

if  min (P . *Q) ~=0 

D.kumarj  =  0 . 5*sum ( (P . A2-Q . ^2 )  . A2 . / (P . *Q)  . A  ( 3/2 ) ) ; 

else 

D.kumarj  =  0 . 5*sum ( (PI . A2-Q1 . A2)  . A2 . / (PI . *Q1)  . A  (3/2) ) ; 

Error. kumarj  =  'Divide  by  Zero  Error'; 

end 
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D . avgL  =  0 . 5* ( sum (abs (P-Q) ) +max (abs (P-Q) ) ) ; 

%  Vicissitude 
if  min (min (P, Q) ) ~=0 

D.viciswave  =  sum (abs (P-Q) . /min (P, Q) ) ; 
D.vicissymml  =  sum ( (P-Q) . A2 . /min (P, Q) . A2) ; 
D.vicissymm2  =  sum ( (P-Q) . A2 . /min (P, Q) ) ; 

else 

D.viciswave  =  sum (abs (Pl-Ql) . /min (PI, Ql) ) ; 
D.vicissymml  =  sum ( (Pl-Ql) . A2 . /min (PI, Ql) . 
D.vicissymm2  =  sum ( (Pl-Ql) . A2 . /min (PI, Ql) ) 
Error . viciswave  =  'Divide  by  Zero  Error'; 
Error . vicissymml  =  'Divide  by  Zero  Error'; 
Error . vicissymm2  =  'Divide  by  Zero  Error'; 

end 

if  min (max (P , Q) ) ~=0 

D.vicissymm3  =  sum ( (P-Q) . A2 . /max (P , Q) ) ; 

else 

D.vicissymm3  =  sum ( (Pl-Ql) . A2 . /max (PI, Ql) ) 
Error . vicissymm3  =  'Divide  by  Zero  Error'; 

end 

if  min (Q)  ~=  0  &&  min(P)  ~=  0 

D.maxsymm  =  max (D .pearsonchi, D . neymanchi) ; 
D.minsymm  =  min (D . pearsonchi , D . neymanchi ) ; 

else 

D.maxsymm  =  max (D .pearsonchi, D . neymanchi) ; 
D.minsymm  =  min (D .pearsonchi, D . neymanchi)  ; 
Error . maxsymm  =  'Divide  by  Zero  Error'; 
Error . minsymm  =  'Divide  by  Zero  Error'; 

end 

end 
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